Protecting Privacy Using k-Anonymity
نویسنده
چکیده
Objective: There is increasing pressure to share health information and even make it publicly availab However, such disclosures of personal health information raise serious privacy concerns. To alleviate such concerns, it is possible to anonymize the data before disclosure. One popular anonymization approach is kanonymity. There have been no evaluations of the actual re-identification probability of k-anonymized data sets Design: Through a simulation, we evaluated the re-identification risk of k-anonymization and three different improvements on three large data sets. Measurement: Re-identification probability is measured under two different re-identification scenarios. Information loss is measured by the commonly used discernability metric. Results: For one of the re-identification scenarios, k-Anonymity consistently over-anonymizes data sets, with thi over-anonymization being most pronounced with small sampling fractions. Over-anonymization results in excessive distortions to the data (i.e., high information loss), making the data less useful for subsequent analys We found that a hypothesis testing approach provided the best control over re-identification risk and reduces t extent of information loss compared to baseline k-anonymity. Conclusion: Guidelines are provided on when to use the hypothesis testing approach instead of baseline kanonymity. J Am Med Inform Assoc. 2008;15:627– 637. DOI 10.1197/jamia.M2716. many muy of nok to rts, ame izdata ith insvails a arch at ible is out ide. tatd by e ata . The mtheir r een th , icathe be al data ealth ain, nably nths ors de a le). re of priario Introduction The sharing of raw research data is believed to have benefits, including making it easier for the research com nity to confirm published results, ensuring the availabilit original data for meta-analysis, facilitating additional in vative analysis on the same data sets, getting feedbac improve data quality for on-going data collection effo achieving cost savings from not having to collect the s data multiple times by different research groups, minim ing the need for research participants to provide repeatedly, facilitating linkage of research data sets w administrative records, and making data available for struction and education. 1–14 Consequently, there are pre sures to make such research data more generally a able. For example, in January 2004 Canada wa signatory to the OECD Declaration on Access to Rese Data from Public Funding. 17 This is intended to ensure th data generated through public funds are publicly access Affiliations of the authors: Children’s Hospital of Eastern Ont Research Institute (KEE, FD), Ottawa, Ontario, Canada; Pediatrics, Faculty of Medicine, University of Ottawa (KEE), Ottawa, Ontario, Canada. The author(s) declare that they have no competing interests. The authors thank Bradley Malin, Vanderbilt University, and JeanLouis Tambay, Statistics Canada, for their detailed feedback and suggestions on earlier versions of this paper. The authors also thank the anonymous reviewers for many helpful suggestions. Correspondence: Khaled El Emam, Children’s Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario K1J 8L1, Canada; e-mail: [email protected] . Received for review: 01/09/08; accepted for publication: 05/21/08. for researchers as much as possible. 18 To the extent that th is implemented, potentially more personal health data ab Canadians will be made available to researchers world w The European Commission has passed a regulation facili ing the sharing with external researchers of data collecte Community government agencies. 19 There is interest by th pharmaceutical industry and academia to share raw d from clinical trials. 16,20 Researchers in the future may have to disclose their data Canadian Medical Association Journal has recently conte plated requiring authors to make the full data set from published studies available publicly online. 3 Similar calls fo depositing raw data with published manuscripts have b made recently. 2,5,7,20 –22 The Canadian Institutes of Heal Research (CIHR) has a policy, effective on 1 st January 2008 that requires making some data available with publ tions. The UK MRC policy on data sharing sets expectation that data from their funded projects will made publicly available. 24 The UK Economic and Soci Research Council requires its funded projects to deposit sets in the UK Data Archive (such projects generate h and lifestyle data on, for example, diet, reproduction, p and mental health). 25 The European Research Council co siders it essential that raw data be made available prefer immediately after publication, but not later than six mo after publication. 26 The NIH in the US expects investigat seeking more than $500,000 per year in funding to inclu data sharing plan (or explain why that is not possib 27 Courts, in criminal and civil cases, may compel disclosu research data. 11,28 Such broad disclosures of health data pose significant vacy risks. 38 The risks are real, as demonstrated by recent als re 628 El Emam and Dankar, Protecting Privacy Using k-Anonymity successful re-identifications of individuals in publicly disclosed data sets (see the examples in Table 1). One approach for protecting the identity of individuals when releasing or sharing sensitive health data is to anonymize it. A popular approach for data anonymization is k-anonymity. With k-anonymity an original data set containing personal health information can be transformed so that it is difficult for an intruder to determine the identity of the individuals in that data set. A k-anonymized data set has the property that each record is similar to at least another k-1 other records on the potentially identifying variables. For example, if k 5 and the potentially identifying variables are age and gender, then a k-anonymized data set has at least 5 records for each value combination of age and gender. The most common implementations of k-anonymity use transformation techniques such as generalization, global recoding, and suppression. Any record in a k-anonymized data set has a maximum probability 1 k of being re-identified. In practice, a data custodian would select a value of k commensurate with the re-identification probability they are willing to tolerate—a threshold risk. Higher values of k imply a lower probability of re-identification, but also more distortion to the data, and hence greater information loss due to k-anonymization. In general, excessive anonymization can make the disclosed data less useful to the recipients because some analysis becomes impossible or the analysis produces biased and incorrect results. Thus far there has been no empirical examination of how close the actual re-identification probability is to this maximum. Ideally, the actual re-identification probability of a k-anonymized data set would be close to 1 k since that balances the data custodian’s risk tolerance with the extent of distortion that is introduced due to k-anonymization. However, if the actual probability is much lower than 1 k then k-anonymity may be over-protective, and hence results in unnecessarily excessive distortions to the data. In this paper we make explicit the two re-identification Table 1 y Some Examples of Re-identification Attempt General Examples of Re-identification AOL search data AOL put anon web site. Ne search recor Chicago homicide database Students were homicide da Netflix movie recommendations Individuals in recommend in a publicly Health-specific Examples of Re-identification Re-identification of the medical record of the governor of Massachusetts Data from the state employ the governo Southern Illinoisan vs. The Department of Public Health 36 An expert witn neuroblastom one of two a Canadian Adverse Event Database A national bro particular d released by *The former type of data can contain health information (as in the c sexual orientation information (as in the case of one of the individu scenarios that k-anonymity protects against, and show that the actual probability of re-identification with k-anonymity is much lower than 1 k for one of these scenarios, resulting in excessive information loss. To address that problem, we evaluate three different modifications to k-anonymity and identify one that ensures that the actual risk is close to the threshold risk and that also reduces information loss considerably. The paper concludes with guidelines for deciding when to use the baseline versus the modified k-anonymity procedure. Following these guidelines will ensure that reidentification risk is controlled with minimal information loss when using k-anonymity.
منابع مشابه
Enhancing Informativeness in Data Publishing while Preserving Privacy using Coalitional Game Theory
k-Anonymity is one of the most popular conventional techniques for protecting the privacy of an individual. The shortcomings in the process of achieving k-Anonymity are presented and addressed by using Coalitional Game Theory (CGT) [1] and Concept Hierarchy Tree (CHT). The existing system considers information loss as a control parameter and provides anonymity level (k) as output. This paper pr...
متن کاملMulti-dimensional k-anonymity Based on Mapping for Protecting Privacy
Data release has privacy disclosure risk if not taking any protection policy. Although attributes that clearly identify individuals, such as Name, Identity Number, are generally removed or decrypted, attackers can still link these databases with other released database on attributes (Quasi-identifiers) to re-identify individual’s private information. K-anonymity is a significant method for priv...
متن کاملP-Sensitive K-Anonymity with Generalization Constraints
Numerous privacy models based on the k‐anonymity property and extending the k‐anonymity model have been introduced in the last few years in data privacy re‐ search: l‐diversity, p‐sensitive k‐anonymity, (α, k) – anonymity, t‐closeness, etc. While differing in their methods and quality of their results, they all focus first on masking the data, and then protecting the quality of the data as a wh...
متن کاملA Customizable k-Anonymity Model for Protecting Location Privacy
Continued advances in mobile networks and positioning technologies have created a strong market push for location-based services (LBSs). Examples include location-aware emergency services, location based service advertisement, and location sensitive billing. One of the big challenges in wide deployment of LBS systems is the privacy-preserving management of location-based data. Without safeguard...
متن کاملQuality Aware Privacy Protection for Location-Based Services
Protection of users’ privacy has been a central issue for location-based services (LBSs). In this paper, we classify two kinds of privacy protection requirements in LBS: location anonymity and identifier anonymity. While the location cloaking technique under the k-anonymity model can provide a good protection of users’ privacy, it reduces the resolution of location information and, hence, may d...
متن کاملAchieving k-anonymity using Minimum Spanning Tree based Partitioning
Protecting individual‟s privacy has become a major concern among privacy research community. Many frameworks and privacy principles were proposed for protecting the privacy of the data that is being released to the public for mining purpose. k-anonymization was the most popular among the proposed techniques in which the sensitive association between the sensitive attributes and their correspond...
متن کامل